Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control Hopper Env, Stand + Hop #440

Merged
merged 18 commits into from
Jul 26, 2024
Merged

Conversation

Xander-Hinrichsen
Copy link
Collaborator

@Xander-Hinrichsen Xander-Hinrichsen commented Jul 18, 2024

  1. Added hopper xml, hopper environment, hopper stand and hop tasks
    MS-HopperStand-v1 & MS-HopperHop-v1

  2. Made small changes to the following files
    mani_skill/envs/sapien_env.py - added ability to add articulation mount for render only camera - this allows the ppo evaluation videos to be mounted to robot articulation (necessary to view hopper throughout episode and it's movement) - doesn't make any difference for _human_render_cameras if entity_uid in camera config isn't specified - as usual for all previous environments
    mani_skill/utils/building/ground.py - added additional kwargs to ground builder to allow rectangular grounds (no longer just square) and ground xy offsets (both necessary for dm_control planar agent ground visual in mujoco planar agent environments style) - no difference for this api if these additional kwargs aren't specified - as usual for all previous environments
    index - task documentation

  3. Made large changes to the following files
    mani_skill/envs/utils/rewards/common.py - value at margin wasn't implemented, and the sigmoidal reward fns weren't guaranteed to return rewards between 0 and 1! - made changes s.t. it's equivalent to dm_control reward tolerance

These changes make it much easier for implementation of other planar agents (like walker and cheetah), but in my experience, there are issues (like known and documented bugs in mjcf_loader.py and joint tuning difficulty) that make these other environments harder to implement

image

image

environment requires further tuning to more resemble dm_control hopper

5.mp4
2.mp4

args to recreate:
Hop
for i in {1..3}; do python ppo.py --exp_name="hopseed${i}" --env_id="MS-HopperHop-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=100 --num_steps=100 --gamma=0.999 --seed=${i}; done
Stand
for i in {1..3}; do python ppo.py --exp_name="standseed${i}" --env_id="MS-HopperStand-v1" --num_envs=2048 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=100 --num_steps=100 --gamma=0.99 --seed=${i}; done
HopRGB singleseed
python ppo_rgb.py --exp_name="rgbhopseed1" --env_id="MS-HopperHop-v1" --num_envs=1024 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=100 --num_steps=100 --gamma=0.99 --seed=1

# dm_control also includes foot pressures as state obs space
def _get_obs_extra(self, info: Dict):
obs = dict(
toe_touch=self.touch("foot_toe"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does dm control with visual data input ever use state information? This data will be part of the observation (along with _get_obs_agent())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now explicitly overriding _get_obs_state_dict to include entire state (touch + proprioception), so the touch information isn't allowed for rgbd obs mode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dm_control uses pixel wrapper, which allows for user to choose either entire state + pixels or only pixels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking at what seems to be standard practice, DrQ and TDMPC both do not use state at all in training, e.g. https://github.com/nicklashansen/tdmpc2/blob/5f6fadec0fec78304b4b53e8171d348b58cac486/tdmpc2/envs/wrappers/pixels.py#L8

Given the tasks are not that hard, we can do image only ppo as well. The way I will do it is to fix the flatten rgbd obsevation wrapper to be like the pixel wrapper, and you just choose if you want rgb, depth, and / or state, and it will handle it accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re this, merge the main branch into your PR. I have updated code to do this exactly. The new ppo rgb script is this:

python ppo_rgb.py --exp_name="rgbhopseed1" --env_id="MS-HopperHop-v1" --num_envs=512 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=10 --num_eval_steps=600 --num_steps=50 --gamma=0.99 --seed=1 --no_include_state

(the hyperparams might not be well tuned).

found_lost_pairs_capacity=2**25, max_rigid_patch_count=2**18
),
scene_cfg=SceneConfig(
solver_position_iterations=15, solver_velocity_iterations=15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these configs for solver? There is no need to try and match mujoco solver configurations since mujoco is a different physics engine compared to what we use which is physx

Copy link
Collaborator Author

@Xander-Hinrichsen Xander-Hinrichsen Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to solver_position_iteration from default 15 to 4, for sim efficiency (15 was unnecessary), and vel iterations stays default value of 1

solver_position_iterations=15, solver_velocity_iterations=15
),
# sim_freq=50,
control_freq=25,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is control_freq here decided, was it to try and match dm-control?

Copy link
Collaborator Author

@Xander-Hinrichsen Xander-Hinrichsen Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

control_freq was set so the ratio of sim_freq to control is 4:1 (same as dm_control hopper), code now explicitly has sim_freq=100

mani_skill/envs/sapien_env.py Outdated Show resolved Hide resolved
mani_skill/envs/tasks/control/hopper.py Outdated Show resolved Hide resolved
mani_skill/envs/tasks/control/hopper.py Outdated Show resolved Hide resolved
mani_skill/envs/tasks/control/hopper.py Outdated Show resolved Hide resolved
uid="cam0",
pose=Pose.create_from_pq([0, -2.8, 0], euler2quat(0, 0, np.pi / 2)),
width=128,
height=128,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 128x128 correct? I recall dm-control uses smaller images

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found that images are 240 height, 320 width:

import dm_control.suite as suite
from dm_control.suite.wrappers import pixels
import numpy as np

env = suite.load("hopper", "hop")
env_and_pixels = pixels.Wrapper(env, pixels_only=True)

_, _, _, obs = env_and_pixels.reset()

print(obs['pixels'].shape) # prints (240, 320, 3)

this seems large, should we use the same dims?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually turns out people seem to use different resolutions. Let's just stick with 128x128 for now. It runs fast and is sufficient.

Just checked TDMPC2 code and they use 384x384. So not sure how this is picked usually.

for a in actor_builders:
a.build(a.name)

# load in the ground
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for hopper (and also cartpole). I recommend we add a massive long white wall behind so that the rendered videos look a bit nicer with color 0.3, 0.3, 0.3 (this will match what ray-traced videos will look like since they include the ambient lighting which is 0.3 0.3 0.3). Moreover with a big background wall if anyone wants to use depth image data they can and it won't be full of 0s (indicating no depth info since it goes to infinity).

Copy link
Collaborator Author

@Xander-Hinrichsen Xander-Hinrichsen Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created new scene builder in mani_skill/utils/scene_builder/control/planar, for re-use for the other planar agents
new eval visual (512x512):

6.mp4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added wall to cartpole.py file directly, since it doesn't fit with planar scene builder or warrant its own scene builder

2.mp4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok we might want to play around with lighting in the future to make these look a bit brighter, but no need for now.

@Xander-Hinrichsen
Copy link
Collaborator Author

Xander-Hinrichsen commented Jul 23, 2024

modified xml file hip range because original hopper robot is bugged - its hip range allows self intersection, getting the robot stuck in a certain position, with small probability on initialization:

dm_control_hopper_bug

@StoneT2000
Copy link
Member

Is the ppo_rgb.py code updated? Seem to run into a few errors due to evaluating 100 steps instead of the max 600.

@Xander-Hinrichsen
Copy link
Collaborator Author

Xander-Hinrichsen commented Jul 23, 2024

Is the ppo_rgb.py code updated? Seem to run into a few errors due to evaluating 100 steps instead of the max 600.

yes, I added ppo_rgb 1 line change to get it to work in new commit - there was no check that success exists in final_info dict in ppo_rgb.py when tensorboard writing, while there is a check in ppo.py

for i in {1..3}; do python ppo_rgb.py --exp_name="pixels_hopseed${i}" --env_id="MS-HopperHop-v1" --num_envs=256 --update_epochs=8 --num_minibatches=8 --total_timesteps=25_000_000 --num-steps=100 --num_eval_steps=600 --gamma=0.995 --seed=${i}; done

image

performance is worse without the pressure sensor information

@StoneT2000
Copy link
Member

StoneT2000 commented Jul 23, 2024

It is hard to tell if it's "worse" since this is a different return scale now? Also i tried running again but the performance after 5M steps looks like this:

9.mp4

Looks like it learns to flip forward to go fast now, instead of the original behavior.

@StoneT2000
Copy link
Member

StoneT2000 commented Jul 23, 2024

My script:

python ppo_rgb.py --exp_name="rgbhopseed1" --env_id="MS-HopperHop-v1" --num_envs=512 --update_epochs=8 --num_minibatches=32 --total_timesteps=20_000_000 --eval_freq=15 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --seed=1 --no-include-state

(merge main for this to work)
This seems to perform better than the script you ran with these hyperparams. It does show some sign that it might learn the "right" behavior (it could also be that our hopper setup affords more optimal flipping than mujocos)

image

Maybe behavior will be removed if the agent does not have so much power behind its actions. I can try and experiment as well.

@Xander-Hinrichsen
Copy link
Collaborator Author

This has been my experience with the damping, stiffness, and delta being very important for desired behvior. If the damping especially is too low, then the hopper is unable to upright itself, if it's too high, then it can instead flip to its feet too easily.

@StoneT2000
Copy link
Member

can you somehow revert to the settings that got the first video that was working? or was it only working from state inputs

@Xander-Hinrichsen
Copy link
Collaborator Author

Xander-Hinrichsen commented Jul 24, 2024

it was only working from state inputs, these are the same controller settings that produced the first videos

@StoneT2000
Copy link
Member

StoneT2000 commented Jul 24, 2024

Arguably though with the flipping it seems like it achieves the same distance if not further than the standard "behavior". I would then say this is more of a result of the reward function. Since its not possible to get the same exact simulation as mujoco, maybe we can just penalize flipping (just penalize the x/y axis rotation velocity should be sufficient, whichever is the axis pointing to the camera). It's fine if our reward function is slightly different to mujocos.

I find many other repos tackling locomotion like isaac lab / orbit will often add reward terms not meant to make the robot run faster, but to have a smoother gait, which is what we will be doing here essentially.

@Xander-Hinrichsen
Copy link
Collaborator Author

okay, that sounds good, I will try some experiments with discouraging flipping reward

@Xander-Hinrichsen
Copy link
Collaborator Author

Xander-Hinrichsen commented Jul 24, 2024

Maybe behavior will be removed if the agent does not have so much power behind its actions. I can try and experiment as well.

No need to try to experiment, I have found that y axis rotation bounds for reward works well, will run tests and report here later tonight

edit: update - It worked well for a quick check on state obs - to ensure no flipping with large deltas. However, I've tried many, many different things, and unfortunately, any additional reward doesn't fix the problem - in fact, I believe the flipping was originally a local min solution, where hopping would result in greater reward, since the flipping doesn't always activate the standing reward portion of the hopping reward

currently testing out non delta joint pos

@StoneT2000
Copy link
Member

Maybe behavior will be removed if the agent does not have so much power behind its actions. I can try and experiment as well.

No need to try to experiment, I have found that y axis rotation bounds for reward works well, will run tests and report here later tonight

edit: update - It worked well for a quick check on state obs - to ensure no flipping with large deltas. However, I've tried many, many different things, and unfortunately, any additional reward doesn't fix the problem - in fact, I believe the flipping was originally a local min solution, where hopping would result in greater reward, since the flipping doesn't always activate the standing reward portion of the hopping reward

currently testing out non delta joint pos

what do you mean non delta joint pos? all of them should use that controller

@Xander-Hinrichsen
Copy link
Collaborator Author

Xander-Hinrichsen commented Jul 25, 2024

what do you mean non delta joint pos? all of them should use that controller
I tried pd_joint_pos instead of pd_joint_delta_pos is what I meant, didn't work though

I've diagnosed the problem:
Originally, I tried to implement the robot center of mass x velocity, but it didn't work, so I just used the xvelocity of the slider joint holding the hopper in the xz plane (see in current implementation of subtreelinvelx function).

Since the entire robot's center of mass x velocity isn't taken into consideration (and only the torso xvel which is attached to the slider), the robot isn't penalized for flipping in a large circle as long as its torso continues to move in the +x world frame direction

Here is original robot center of mass velocity function, which is slow and doesn't work

    @property  # dm_control mjc function adapted for maniskill
    def subtreelinvelx(self):
        """Returns linvel x component of robot"""
        total_mass = 0.0
        vels = []
        links = self.agent.robot.get_links()
        # skip first three dummy links
        for i in range(3,len(links)):
            vel = links[i].get_linear_velocity() # vel is (b,3)
            mass = links[i].get_mass()[0]
            vels.append(vel * mass.item())
            total_mass += mass.item()
        vels = torch.stack(vels, dim=0) # (#links, b, 3)
        com_vel = vels.sum(dim=0) / total_mass # (b, 3)
        return com_vel[:,0]

it seems like the get_linear_velocity() function doesn't return the center of mass linear velocity of the links?, which is causing the issue.

If calculating the center of mass velocity for an entire robot is non-trivial, this is an issue for hopper + all planar + all humanoid dm_control task implementations, since they com velocity as reward

@StoneT2000
Copy link
Member

Lemme get back to you. I think I also just found a bug with the GPU sim linear and angular velocities. CPU sim is correct but on GPU sim they might flipped. Working with @fbxiang now to debug it.

@StoneT2000
Copy link
Member

StoneT2000 commented Jul 25, 2024

Actually you can use the torso link. It should work. That is the link they are reading the center of mass velocities from. All velocity data is center of mass actually.

e.g.

link = self.agent.robot.links_map["torso"]
print(link.linear_velocity, link.angular_velocity)

Also merge in main before doing more experiments. I just fixed #450 merged in #452 , so the link velocities used or com velocities are wrong lol

Thinking also more, maybe it is also better to just penalize the y-axis rotation, instead of velocity (its okay to go fast, just don't flip over).

Copy link
Member

@StoneT2000 StoneT2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small change requested i didnt catch earlier. See other comments above

@Xander-Hinrichsen
Copy link
Collaborator Author

Xander-Hinrichsen commented Jul 26, 2024

**The results have large variance because in both hop and stand tasks, in seed 2, ppo learns to straighten its legs out too early, and doesn't explore tucking them in when hopper is on its belly, in order to get up - ppo then gets stuck at this local optimum in seed 2 - This is a a result of lack of exploration in ppo, not a fault of the environment, as shown below

Standing results and command:

7.mp4

image

for i in {1..3}; do python ppo.py --exp_name="standseed${i}" --env_id="MS-HopperStand-v1" --num_envs=4096 --update_epochs=8 --num_minibatches=32 --total_timesteps=30_000_000 --eval_freq=10 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --ent_coef=0.01 --seed=${i}; done

Hopping results and command:

12.mp4

image

for i in {1..3}; do python ppo.py --exp_name="hopseed${i}" --env_id="MS-HopperHop-v1" --num_envs=4096 --update_epochs=8 --num_minibatches=32 --total_timesteps=60_000_000 --eval_freq=10 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --ent_coef=0.01 --seed=${i}; done

RGB Hop results and command:

52.mp4

image

python ppo_rgb.py --exp_name="pixelhop" --env_id="MS-HopperHop-v1" --num_envs=512 --update_epochs=8 --num_minibatches=32 --total_timesteps=40_000_000 --eval_freq=15 --num_eval_steps=600 --num_steps=100 --gamma=0.99 --seed=1 --no-include-state

Local Optima in PPO training

seed 2 ppo local optimum:
ppo learns to straighten its legs out no matter what, and doesn't overcome this local optimum during training in seed 2

7.mp4

seed 1 and 3 don't have this issue:

seed 1 stand:

7.mp4

seed 3 stand:

7.mp4

**and even if the hopper learns to flip, it is a local minimum as it will eventually learn that hopping results in more rewards than flipping, this is because the center of mass of entire robot body reward correctly rewards hopping more than flipping

seed 3 of hopping command above first learns to flip, then later learns better how to hop to maximize the reward, eventually forgetting the flipping behavior:**
remnant of flipping behavior shown during middle of training:

9.mp4

@StoneT2000 StoneT2000 merged commit ebf0282 into haosulab:main Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants